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A BAYESIAN x 2 TEST FOR GOODNESS-OF-FIT 

By Valen E. Johnson 

University of Michigan 

This article describes an extension of classical \ 2 goodness-of-fit 
tests to Bayesian model assessment. The extension, which essentially 
involves evaluating Pearson's goodness-of-fit statistic at a parameter 
value drawn from its posterior distribution, has the important prop- 
erty that it is asymptotically distributed as a \ 2 random variable on 
K — 1 degrees of freedom, independently of the dimension of the un- 
derlying parameter vector. By examining the posterior distribution 
of this statistic, global goodness-of-fit diagnostics are obtained. Ad- 
vantages of these diagnostics include ease of interpretation, compu- 
tational convenience and favorable power properties. The proposed 
diagnostics can be used to assess the adequacy of a broad class of 
Bayesian models, essentially requiring only a finite-dimensional pa- 
rameter vector and conditionally independent observations. 

1. Introduction. Model assessment presents a challenge to Bayesian statis- 
ticians, one that has become an increasingly serious problem as computa- 
tional advances have made it possible to entertain models of a complexity 
not considered even a decade ago. Because diagnostic methods have not 
kept pace with these computational advances, practitioners are often faced 
with the prospect of interpreting results from a model that has not been 
adequately validated. 

Numerous solutions to this problem have been considered. The most or- 
thodox of these depend on the specification of alternative models and the use 
of Bayes factors for model selection. This approach is reasonable when both 
a relatively broad class of models can be specified as alternatives, and when 
implied Bayes factors can be readily computed. Unfortunately, it often hap- 
pens in practice that neither requirement is satisfied, making this approach 
impractical for routine application. Complicating the situation still further is 
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the fact that Bayes factors are not defined when improper priors are used in 
model specification, although this difficulty may be partially circumvented 
through the use of intrinsic Bayes factors or related devices [e.g., Berger and 
Pericchi (1996) and O'Hagan (1995)]. 

A second strategy for assessing model adequacy centers on the use of 
posterior-predictive model checks. This approach was initially proposed by 
Guttman (1967) and Rubin (1984), and was extended to more general dis- 
crepancy functions by Gelman, Meng and Stern (1996). [Gelfand (1996) 
has advocated related techniques based on cross-validatory predictive den- 
sities.] The primary advantage of posterior-predictive model assessment is 
its relative ease of implementation. In many models, the output from nu- 
merical algorithms used to generate samples from the posterior distribution 
can be used to generate observations from the predictive model, which in 
turn can be used to compute p-values for the discrepancy function of inter- 
est. Posterior-predictive model assessment also facilitates case-diagnostics, 
which, in many circumstances, are more telling in examining model fit than 
are global goodness-of-fit statistics. However, such approaches also have an 
important disadvantage. As Bayarri and Berger (2000) and Robins, van 
der Vaart and Ventura (2000) and others have noted, they do not pro- 
duce p-values that have (even asymptotically) a uniform distribution. Be- 
cause output from predictive posterior model checks is not calibrated, using 
p-values based on them for model assessment is problematic. 

Bayarri and Berger (2000) and Robins, van der Vaart and Ventura (2000) 
propose alternative distributions under which p- values, and thus model diag- 
nostics, can be calculated. These include partial posterior-predictive p- values 
and conditional predictive p- values [Bayarri and Berger (2000)], and modifi- 
cations to posterior predictive and "plug- in" p- values [Robins, van der Vaart 
and Ventura (2000)]. The attractive feature of each of these variations on 
more standard definitions of p- values is that these statistics are distributed 
either as U(0,1) random variables, or approach U(0, 1) random variables 
as sample sizes become large. Their drawback is that they can seldom be 
defined and calculated in realistically complex models. 

The goal of this article is to present a goodness-of-fit diagnostic that 
bridges the gap between diagnostics that are easy to compute but whose 
null distributions are unknown, and diagnostics whose null distributions are 
known but that cannot generally be computed. The proposed diagnostic is 
closely related to the classical x 2 goodness-of-fit statistic, whose properties 
are now briefly reviewed. 

In the case of a point null hypothesis, the standard x 2 statistic may be 
defined as 
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where represents the number of observations observed within the Arfch par- 
titioning element, pk the probability assigned by the null model to this inter- 
val, K the number of partitions or intervals specified over the sample space 
and n the sample size. For independent and identically distributed data sat- 
isfying certain regularity requirements, Pearson (1900) demonstrated that 
the asymptotic distribution of R° was x 2 with K — 1 degrees of freedom. 

The situation for composite hypotheses is more complicated. Assuming 
that bins are determined a priori, Cramer [(1946), pages 426-434] demon- 
strated that the distribution of 

R g _ Y - UP ^ 2 
fc =i n Pk 

is that of a x 2 random variable with K — s — 1 degrees of freedom, where 
s denotes the dimension of the underlying parameter vector 6 and {pf.} de- 
note estimates of the bin probabilities based on either maximum likelihood 
estimation for the grouped data or on the minimum x 2 method. Maximum 
likelihood estimation for the grouped data implies maximization of the func- 
tion 
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with respect to 6, while minimum x 2 estimation involves the determination 
of a value of 6 that minimizes a function related to R 9 . 

The statistic R 9 is the form of the x 2 test most often used in statistics, 
where it is routinely used to test independence in contingency tables [see, 
e.g., Fienberg (1980)]. In that context, grouped maximum likelihood esti- 
mation is natural. Although the Bayesian x 2 statistic proposed below can 
be extended for testing independence in contingency tables, this is not its 
intended purpose. Instead, it is intended primarily for use as a goodness- 
of-fit test. In this regard, the aspect of model fit assessed is similar to that 
examined using the classical x 2 goodness-of-fit test; namely, the proportion 
of counts observed in predefined parcels of the sample space is compared to 
the proportion of counts that are expected in these parcels under a specified 
probability model. 

Chernoff and Lehmann (1954) considered the distribution of the x 2 statis- 
tic in the more typical situation in which values of the bin probabilities 
are based on maximum likelihood estimates obtained using the raw (un- 
grouped) data. Denote these values by pk- In this case, the distribution of 
the goodness-of-fit statistic is generally not one of a x 2 distribution, but 
instead produces a value R that has a distribution that falls stochastically 
between R° and R 9 . For models containing many parameters, the gap be- 
tween the degrees of freedom associated with these two statistics is large, 
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and, as a result, the x 2 goodness-of-fit test based on the maximum likelihood 
estimate is usually not useful for assessing model fit in high-dimensional set- 
tings. 

The goodness-of-fit statistic proposed here represents a modification of 
the x 2 statistics considered above. The modification, denoted by R B {6) (or 
more simply, by R B when no confusion arises), is obtained by fixing the 
values of and instead considering the bin counts as random quan- 
tities. Allocation of observations to bins is made according to the value of 
each observation's conditional distribution function, conditionally on a sin- 
gle parameter value 9 sampled either from the posterior distribution or the 
asymptotic distribution of the maximum likelihood estimator. [The statis- 
tic obtained in this way has some resemblance to the x 2 statistics consid- 
ered by, e.g., Moore and Spruill (1975), although emphasis there focuses on 
randomized cells rather than on posterior sampling of parameter vectors.] 
The distinguishing feature of R B {6) is that, for many statistical models, its 
asymptotic distribution \s on K — \ degrees of freedom, independently of 
the dimension of the parameter vector 6. 

Because it is the sampling distribution of R B that has a x 2 distribution, 
one might argue that this procedure does not really represent a Bayesian 
goodness-of-fit diagnostic. However, sampling parameter values from a dis- 
tribution for the purpose of inference occurs more naturally within the 
Bayesian paradigm, and for this reason it is likely that the proposed di- 
agnostic will find more application there. In addition, the formal test statis- 
tics proposed below are based on the posterior distribution of R B . For this 
reason, values of 6 used in the definition of R B are assumed to represent 
samples from the posterior distribution on the parameter vector, rather than 
samples generated from the asymptotic normal distribution of the maximum 
likelihood estimator. However, either interpretation is valid. 

The remainder of the paper is organized as follows. In the next section, 
the Bayesian x 2 statistic R B is defined and its asymptotic properties are 
described. Corollaries extending these properties from i.i.d. observations to 
conditionally independent observations and to fixed-bin applications are pre- 
sented, and strategies for combining information contained in dependent 
samples of R B values generated from the same posterior distribution are 
described. Following this, several examples that illustrate the application 
of this statistic and summaries from its posterior are presented. Discussion 
and concluding remarks appear in Section 4. Proofs of the theorem and 
corollaries of Section 2 appear in the Appendix. 

2. A Bayesian x 2 statistic. To begin, let yi,---,y n (= y) denote scalar- 
valued, continuous, identically distributed, conditionally independent obser- 
vations drawn from probability density function f(y\6) defined with respect 
to Lebesgue measure and indexed by an s-dimensional parameter vector 
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€ & C R s . Denote by F(-\0) and F~ 1 (-\0) the (nondegenerate) cumulative 
distribution and inverse distribution functions corresponding to f(-\0). To 
construct a sampled value from the posterior, augment the observed sam- 
ple y with an i.i.d. sample vi,...,v s from a U(0, 1) distribution. Let p(0\y) 
denote the posterior density of based on y, and let p(0j \0±, . . . , 0j_i,y) de- 
note the marginal conditional posterior density of 0j given (Ox, . . . , 0j-x,y)- 
Define implicitly by 

(1) vx = [ 6l p(ex\y)detx,...,v s = f S p(9 s \9 1 ,...,9 s . 1 ,y)d9 s . 

J — oo J — oo 

Thus, denotes a value of sampled from the posterior distribution based 
on y. Let 0q denote the true but unknown value of 0. The maximum likeli- 
hood estimate of is denoted by 0. 

To construct the Bayesian goodness-of-fit statistic proposed here, choose 
quantiles = ao < ax < • • • < an-i < = L withpfc = a^ — a^-i, k = 1, . . . , K. 
Define Zj(0) to be a vector of length K whose fcth element is unless 

(2) F( yj \0)e(a k _x,a k ], 
in which case it is 1. Finally, define 

n 

m(0) = XX0). 

i=i 

It follows that the fcth component of m(#), mk(0), represents the number 
of observations that fell into the fcth bin, where bins are determined by the 
quantiles of the inverse distribution function evaluated at 0. Finally, define 



(3) R B {0) = Y, 



(m k (0) - np fc ) l 2 

The asymptotic distribution of R B is provided in the following theorem. 



k=l 



Theorem 1. Assuming that the regularity conditions specified in the 
Appendix apply, R B converges to a x 2 distribution with K — 1 degrees of 
freedom as n — > oo . 

The simplicity of Theorem 1 is somewhat remarkable given the complexity 
of the corresponding distribution of R. As mentioned above, the asymptotic 
distribution of R does not, in general, follow a x 2 distribution. Instead, it has 
the distribution of the sum of a x 2 random variable with K — s — 1 degrees 
of freedom and the weighted sum of s additional squared normal deviates 
with weights ranging from to 1. In contrast, the asymptotic distribution 
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of R B follows a Xk-i distribution, independently of the dimension of the 
parameter vector 6. 

Heuristically, the idea underlying Theorem 1 is that the degrees of freedom 
lost by substituting the grouped MLE for 6 in Pearson's \ 2 statistic are 
exactly recovered by replacing the MLE with a sampled value from the 
posterior in R B . That is, the s degrees of freedom lost by maximizing over the 
grouped likelihood function to obtain R 9 are exactly recovered by sampling 
from the s-dimensional posterior on 6. 

As a corollary, Theorem 1 can be extended to the more general case in 
which the functional form of the density f(y\0) varies from observation to 
observation. Specifically, if the density of the jib. observation is denoted by 
fj(y\6), with distribution and inverse distribution functions Fj and F~ , 
respectively, then the following corollary also applies. 

Corollary 1. Assume the conditions referenced in Theorem 1 are ex- 
tended so as to provide also for the asymptotic normality of both the posterior 
distribution on 6 and the maximum likelihood estimator when the likelihood 
function is proportional to 

n 

T[fM\°)- 

3=1 

Assume also that the functions fj(-\0) satisfy the same conditions implied 
in Theorem 1 for f{-\6). Define the kth component of Zj(6) to be 1 or 
depending on whether or not 

(4) Fj(yj\0)€(a k - u a k ], 

with a fixed. Then the asymptotic distribution of R B based on this revised 
definition of Zj(0) is y 2 with K — 1 degrees of freedom. 

Outlines of the proof of Theorem 1 and the corollary appear in the Ap- 
pendix. 

From a practical perspective, the corollary is important because it extends 
the definition of R B to essentially all models in which observations are con- 
tinuous and conditionally independent given the value of a finite-dimensional 
parameter vector. 

The results cited above for continuous-valued random variables can be 
extended to discrete random variables in one of two ways. The most direct 
extension is to simply proceed as in the continuous case, using a random- 
ization procedure to allocate counts to bins when the mass assigned to an 
observation spans the boundaries defining the bins. The second is to define 
fixed bins in the standard way based on the possible outcomes of the random 
variable, and to then evaluate the bin probabilities at sampled values of 6 
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from the posterior distribution. That is, if f(y\0) denotes the probability 
mass function of a discrete random variable y and 

1 n 

(5) p*(*) = -£ E /;(y|0)> 

j'=l j/Gbin A; 

then the y 2 statistic i? B may be redefined as 

(6) R B (0) = J2 

k=l 

In this case, the asymptotic distribution of R B (6) is similar to that described 
above in the continuous case and is detailed in the following corollary. 



(m k - np k {6)) 

\/nr>,Jfl\ 



Corollary 2. If the regularity conditions specified in Theorem 1 apply 
to the discrete probability mass function f(y\6), then, using predefined bins 
and the definition of the bin probabilities given in (5), the distribution of 
R B {6) as defined in (6) converges to a x 2 distribution with K — 1 degrees 
of freedom as n — > oo . 

The asymptotic x 2 distribution of R B {6) described in the theorem and 
corollaries above is achieved when a large sample of independent observa- 
tions is drawn from a sampling density, and a value of 9 is drawn from the 
posterior induced by this observation. However, when two values of 6 are 
drawn from the same posterior distribution (i.e., based on the same obser- 
vation), the values of R B that result are correlated. This correlation com- 
plicates the interpretation of test statistics defined with respect to posterior 
distribution on R B values. 

Combining information across a posterior sample of R B values might be 
accomplished in a variety of ways, including modifications of the methodolo- 
gies proposed in Verdinelli and Wasserman (1998) or Robert and Rousseau 
(2002). Another possibility is to simply report the proportion of R B values 
drawn from the posterior distribution that exceeds a specified critical value 
from their nominal x\-\ distribution. For a given data vector and prob- 
ability model, such a procedure might lead to a statement that, say, 90% 
of R B values generated from the posterior distribution exceeded the 95th 
percentile of the reference x 2 distribution. 

Though decidedly non-Bayesian, such a report is convenient from several 
perspectives. By reporting the proportion of R B values that exceeds the 
critical value of the test, the unpalatable aspect of basing a goodness-of-fit 
test on a randomly selected value of R B is avoided. It is also straightforward 
to compare the proportion of R B values that exceeds the critical value of the 
test to the size of the test; if the R B values did represent independent draws 
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from their nominal x 2 distribution, the proportion of values falling in the 
critical region of the test would exactly equal the size of the test. Any excess 
in this proportion must therefore be attributed either to dependence between 
the sampled values of R B from the given posterior or lack of fit. Finally, and 
perhaps most importantly, this strategy requires almost no computational 
effort. In most practical Bayesian models, values of R B can be computed 
almost as an afterthought within the MCMC schemes used to sample from 
the posterior distribution of the parameter vector. 

In the event that formal significance tests must be performed to assess 
model adequacy, they can be based on a comparison of the observed value of 
a summary statistic based on the posterior distribution of R B values to an 
approximation of the sampling distribution of the summary statistic induced 
by repeated sampling of the data vector. The summary statistic considered 
here is defined as the posterior probability that a value of R B drawn from 
the posterior distribution (based on a single value of y) exceeds the value 
of a Xk-i random variable. This probability, denoted by A, is related to a 
commonly used quantity in signal detection theory and represents the area 
under the ROC curve [e.g., Hanley and McNeil (1982)] for comparing the 
joint posterior distribution of R B values to a x\-\ random variable. The 
expected value of A, if taken with respect to the joint sampling distribution 
of y and the posterior distribution of given y, would be 0.5. Large devi- 
ations in the expected value of A from 0.5, when the expectation is taken 
with respect to the posterior distribution of 6 for a fixed value of y, indicate 
model lack of fit. 

Unfortunately, approximating the sampling distribution of A is a numer- 
ically burdensome endeavor, and calculating it obviates many of the ad- 
vantages that are gained by using a test statistic with a known reference 
distribution. To a large extent, the computations required to approximate 
^4's sampling distribution are as complicated as, or even more complicated 
than, similar techniques used to approximate the sampling distribution of 
discrepancy functions used in posterior-predictive model checks [e.g., Sin- 
haray and Stern (2003)]. However, knowing the nominal value of A makes 
this computation unnecessary when the observed value of A falls within sev- 
eral hundredths of 0.5 or is smaller than 0.5. Procedures for approximating 
the sampling distribution of A for the purpose of determining the signifi- 
cance of departures of the observed value of A from 0.5 are described in 
the examples using methodology delineated by Dey, Gelfand, Swartz and 
Vlachos (1998). 

As an aside, it is interesting to compare the test statistic R B and its 
reference distribution to the x 2 discrepancy function and its reference dis- 
tribution as proposed in Gelman, Meng and Stern (1996). The reference 
distribution of R B {6) is obtained by sampling y from its "true" distribution 
F(-\6q), an d then sampling a single value of from the posterior distribution 



A BAYESIAN X 2 TEST FOR GOODNESS-OF-FIT 



9 



of given y. The resulting distribution is asymptotically Xx-ii this resu lt is 
unrelated to posterior-predictive distributions or samples drawn from them. 
In contrast, the reference distribution of the x 2 discrepancy function pro- 
posed by Gelman, Meng and Stern (1996) is obtained as the distribution of 
the statistic 



induced by repeatedly drawing values y pp = (yf , • ■ • , from the posterior- 
predictive density based on the observed data vector y. As Gelman, Meng 
and Stern point out, this statistic does not have a x 2 distribution. 

The power characteristics of the Bayesian x 2 statistics defined above, like 
their classical counterparts, depend on the selection of the bin probabilities 
Pk . Clearly, consistency of derived tests against general alternatives requires 
that K — > co as n — > oo. On the other hand, as many authors have noted 
[see, e.g., Koehler and Gan (1990) for a review of this topic], using too many 
cells can result in a significant loss of power. 

A general criterion for choosing cell probabilities was proposed by Mann 
and Wald (1942), who suggested the use of 3.8(n — l) 0-4 equiprobable cells. 
Subsequent authors [e.g., Williams (1950), Watson (1957), Hamdan (1963), 
Dahiya and Gurland (1973), Gvanceladze and Chibisov (1979), Best and 
Rayner (1981), Quine and Robinson (1985) and Koehler and Gan (1990)] 
found that the Mann-Wald criterion often results in too many bins and 
loss of power. Based on numerical simulations of seven classes of alternative 
probability models, Koehler and Gan (1990) noted that near-optimal power 
against a Gaussian null model was obtained when the Mann-Wald crite- 
rion was divided by 4. Such a rule also finds approximate agreement with 
simulation results reported by Kallenberg, Oosterhoff and Schriever (1985) 
(although they also recommend the use of nonequiprobable cells against 
certain types of alternative hypotheses). This rule of thumb, which may be 
approximately reformulated as taking n 0A equiprobable cells, was found to 
yield nearly optimal results in the examples described below. 

3. Examples. 

3.1. Goodness- of- fit tests under a normal model with unknown mean and 
variance. In this example, the distribution of R B under a normal model 
is investigated and compared with the distributions of R and R 9 . Posterior 
samples of R B generated from a single data vector are used in ROC-type 
analyses to generate a summary model diagnostic. The power of this test 
statistic is investigated and compared to the power of the test statistic R 9 
when data are generated under nonnormal alternatives. 



(7) 



E 



(yT-E( y m? 

Var(yf|0) 



10 



V. E. JOHNSON 



Let y = (yi, . . . , y^o) denote a random sample from a standard normal dis- 
tribution. For purposes of illustration, assume that the mean fj, and variance 
a 2 of the data are unknown and that the joint prior assumed for (/x, a) is 
proportional to 1/a. Let (/^cr) denote a sampled value from the posterior 
distribution based on y. 

For a given data vector y and posterior sample (fx, a), bin counts rrik(fx, a) 
are determined by counting the number of observations yi that fall into the 
interval (cr$ _1 (afc_i) + fx,a$~ + A)) where < & _1 (-) denotes the standard 
normal quantile function. Based on these counts, R B (fx,a) is calculated ac- 
cording to (3). 

Figure 1 depicts a quantile-quantile plot of R B values calculated for 10,000 
independent samples of y . Each value of R B depicted in this plot corresponds 
to a single draw of (jx, a) from the posterior distribution based on a single 
observation vector y. In accordance with the rule-of-thumb discussed in Sec- 
tion 2, five equiprobable bins were used in the definition of R B . As expected, 
the distribution of R B closely mimics that of a xl random variable. 

The normal deviates used in the construction of Figure 1 were also used 
to compute the classical x 2 statistic based on the maximum likelihood es- 
timates of fx and a (i.e., using the ungrouped data). The quantile-quantile 
plot of 10,000 R values obtained from these data is displayed in Figure 2. In 
this plot, the R values have been plotted against the expected order statis- 
tics from a xl random variable. Five equal probability bins based on the 
standard normal distribution were also used to define these R values. As 
might be expected, the plotted x 2 values display some deviation from their 
approximate xl distribution. 

Grouped maximum likelihood estimates were also used to calculate R 9 
values using these normal samples. The corresponding quantile-quantile plot 
for the 10,000 R 9 values is displayed in Figure 3; as expected, these values 
demonstrate substantially better agreement with a x| random variable than 
do the values depicted in Figure 2. 

Returning to the investigation of the properties of R B , Figure 1 demon- 
strates excellent agreement between this statistic and its asymptotic dis- 
tribution. To illustrate its power in detecting departures from the normal 
model, suppose now that the experiment above is repeated with independent 
Student t variates substituted for the normal deviates. That is, the actual 
observation vectors used in the simulation represent Student t variates, but 
the statistical model used to calculate values of R B is still based on the 
assumption that the data are normally distributed. The degrees of freedom 
of the t variates used in this experiment range from 1 to 10, and for each 
value within this range, 10,000 independent samples of size 50 were drawn. 

To study the power of the statistic R B in detecting departures from nor- 
mality in this experiment, formal significance tests were performed using the 
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Fig. 1. Quantile-quantile plot of R B values for i.i.d. normal data. Values of R B displayed 
in this plot were determined from independent samples of 50 standard normal deviates, and 
are plotted against the expected order statistics from a xl distribution. Posterior samples 
of the mean and variance were estimated using reference priors and observations were 
binned into five bins of equal probability [i.e., a= (0,0.2,0.4,0.6,0.8,1)]. 

statistic A described in Section 2. This statistic may be denned formally as 

(8) A = Pr~ ely (R B (0)>X), X-xl-i, 

and, in repeated sampling of both y and given y, has a nominal value of 
0.5. Numerically, the value of A, for a fixed data vector y, can be approxi- 
mated in a straightforward way using Monte Carlo integration. 

Formal model assessment using the statistic A can be based on approxi- 
mating the sampling distribution of A using "posterior-predictive-posterior" 
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Expected order statistics 

Fig. 2. Quantile-quantile plot of R values for i.i.d. normal data. Values of R displayed in 
this plot were each determined from a separate sample of 50 standard normal deviates, and 
are plotted against the expected order statistics from a xl distribution. For comparison, 
the top curve depicts values of expected order statistics from a xl distribution. 

model checks [e.g., Dey, Gelfand, Swartz and Vlachos (1998)]. That is, sam- 
pled values 6 from the posterior can be used to generate posterior-predictive 
observations y pp according to f(-\0). In large samples, values of will be 
close to 0o, and so the distribution of y pp will be close to the distribution 
of y. Posterior-predictive-posterior values of A pp can be generated for each 
value of y pp by averaging R B , computed from y pp , over the posterior dis- 
tribution of induced by y pp . Values of obtained from this procedure 
approximate the sampling variability of the summary test statistic A that 
can be attributed to computing the probability in (8) using the posterior 
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Fig. 3. Quantile-quantile plot of R 9 values for i.i.d. normal data. Values of R 3 displayed 
in this plot were each determined from a separate sample of 50 standard normal deviates, 
and are plotted against the expected order statistics from their asymptotic xl distribution. 

distribution of 6 for a given value of y, without averaging over the distribu- 
tion of y. The value of A obtained for the original data vector can then be 
compared to the empirical distribution of the values of A pp obtained from 
the posterior distribution on the posterior-predictive data. 

In principle, exactly this procedure can be implemented to calculate the 
probability that the test statistic A, based on a random sample of t variates, 
falls into the critical region of a test based on the empirical distribution of 
sampled values A pp . In this case, however, it is not necessary to generate 
values of A pp for each sample of t variates. Under the normal model, val- 
ues of R B are invariant to shifts in location and scale of the data, so the 
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sampling distribution of A, for any future draw of 50 i.i.d. normal deviates, 
can be approximated by the empirical distribution of A values obtained un- 
der the normal sampling scheme used at the beginning of this example. It 
follows that critical regions for significance tests based on A are exact un- 
der this model, save for the Monte Carlo error encountered in the empirical 
approximation of their distribution. 

Figure 4 displays the proportion of times in 10,000 draws of t samples 
that the value of the test statistic A was larger than the 0.95 quantile of the 
sampled values of A pp . For comparison, the observed power of the test based 
on the grouped-maximum-likelihood x 2 statistic R 9 at the 5% level is also 
shown, as is the observed power obtained using a randomized test based on 
only a single value of R B . To facilitate comparison with the distribution of 
R B , five equiprobable bins from a standard normal distribution were used 
in the definition of R g . 




2 4 6 8 10 

degrees of freedom 

Fig. 4. Power of test statistics A, R B and R 9 in detecting departures from normality 
when data are distributed according to t distributions. The uppermost curve depicts the 
power of the test statistic A against t alternatives with degrees of freedom displayed on the 
horizontal axis. The curve in the middle depicts the corresponding power of a single value 
of R B when compared to a xa distribution. The curve at the bottom of the plot represents 
the power of R 3 against the t alternatives. All values of the power refer to the power of 
the test statistics in rejecting the null hypothesis of normality in significance tests of size 
0.05 and samples of size 50. 
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From Figure 4, it is clear that the test statistic A offers substantially better 
power than R 9 against this class of alternative models. Part of this advantage 
stems from the symmetry and unimodality of the alternative hypotheses, 
which RP is ill-equipped to accommodate, and part from the fact that the 
bins used in the definition of R 9 were fixed according to the null hypothesis. 
Substantially better power could be obtained by using the test statistic R 
with bins based on the particular y vector observed, but such tests do not 
achieve their nominal levels of significance. Perhaps surprisingly, the power 
of a randomized test based on a single value of R B is comparable to the 
power of A based on the complete posterior distribution of R B values. 

3.2. Lip cancer data. Spiegelhalter, Best, Carlin and van der Linde (2002) 
describe a re-analysis of lip cancer incidence data originally considered by 
Clayton and Kaldor (1987). Their purpose in examining these data was to 
illustrate the use of the deviance information criterion (DIC) to select from 
among five potential models for the number of lip cancer cases, j/j, observed 
in 56 Scottish districts as a function of available age and sex adjusted ex- 
pected rates Ei . These data and models are reconsidered here for the related 
purpose of assessing which of the models provides an adequate probabilistic 
description of the data. 

Following the Spiegelhalter et al. analysis of these data, begin by assum- 
ing that iji is Poisson with mean /Xj = exp(0j)i£j. Five models for 9i are 
considered: 

1. 6i = cxq, ao a constant. 

2. 0i = oiq + ji, ji exchangeable random effects. 

3. 9i = ao + Si, 5i spatial random effects with a conditional autoregressive 
prior [e.g., Besag (1974)]. 

4. 6i = ao + Si + 7i, Si and ji as above. 

5. 6i = cti, cti uniform on (—00,00). 

Five thousand, thinned posterior samples of fi = {/ij} were generated for 
each of these models using WinBUGS code [Spiegelhalter, Thomas and Best 
(2000)] kindly provided by Dr. Best. For each sampled value of /u,i, the Pois- 
son counts yi were assigned to one of five equiprobable bins defined according 
to the Poisson distribution function evaluated at yi for the given value of m . 
In those cases for which the probability mass function assigned to fji spanned 
more than one bin, allocation to a single bin was performed randomly ac- 
cording to the proportion of mass assigned to the bins. Averaging over all 
posterior samples of for a given model yielded the values of A depicted in 
Table 1. Because 56 data points were available, five bins were again used in 
the definition of the individual values of R B . The proportion of R B values 
exceeding the 95th quantile from a xl distribution was computed using the 
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Table 1 

Values of the goodness-of-fit statistic A and the proportion of 
critical R B values for models of lip cancer incidence data 



Model 


A 


Proportion of R B > 9.49 


DIC 


1 


0.999 


1.000 


382.7 


2 


0.517 


0.055 


104.0 


3 


0.538 


0.076 


89.9 


4 


0.537 


0.075 


89.7 


5 


0.677 


0.198 


111.7 



The second column provides the value of the summary statistic A 
achieved for each model. The third column lists the proportion of poste- 
rior samples of R B that exceeded the 95th quantile of a x| distribution 
for each model. DIC values obtained under the "mean" parameteriza- 
tion are listed for comparison. 



posterior sample fi. No posterior-predictive or posterior-predictive-posterior 
computations were performed to obtain these values. 

In Table 1, both the large value of A and the large proportion of R B values 
exceeding the 95th quantile of the x\ distribution provide a clear indication 
of lack of fit for the first model. Lack of fit in this instance can be attributed 
to the failure of the model to adjust for district effects; the posterior mean of 
the number of counts assigned to the five bins was (16.0,4.9,5.2,7.1,22.8). 

The values of A and proportions of extreme values of R B reported in rows 
2-4 do not suggest lack of fit of the aspect of these models being tested by 
the x 2 test. 

The most interesting row in Table 1 is the last, which corresponds to 
fitting a separate Poisson model for each observation. The value of A for this 
model is 0.68, and nearly 20% of R B values generated from its posterior — 
nearly four times the number expected — exceeded the 5% critical value from 
the x\ distribution. 

At first glance, one might suspect that these suspicious values arise from 
overfitting. However, the last model generates the most dispersed posterior 
distribution of any of the models considered, since only one observation 
figures into the marginal posterior of each /Xj. Instead, the difficulty with 
this model arises from the prior assumptions made on /x. The assumption of 
a uniform prior on 9i implies a prior for the mean of each Poisson observation 
proportional to 1//U.;; this prior shrinks the estimate of every fti toward 0. 
This results in an overabundance of counts in the higher bins and larger than 
expected values of R B . The posterior mean of the bin counts for this model 
was (8.4,9.8, 10.9, 12.1, 14.8). Refitting the fifth model with noninformative 
priors proportional to 1/ ' ^/Jm yielded a value of A = 0.501 and only 4.7% of 
R B values exceeding 9.49. 
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It is also interesting to compare the values in the second and third columns 
of this table with those provided for the DIC. All statistics suggest inade- 
quacy of the first model, though for different reasons. For the first model, 
the high values of A and R B suggest that the data do not follow Poisson 
distributions with a common scaling of adjusted expected rates. The value 
of the DIC statistic suggests either that the model does not fit the data or 
that it is not as precise in predicting the data as the other models consid- 
ered. An advantage of the x 2 statistics in this case is that their values are 
interpretable without fitting alternative models. 

The comparatively large value of the DIC statistic for the second model 
can be attributed to greater dispersion in its posterior as compared to poste- 
rior dispersion of the third and fourth models, even though the exchangeable 
model appears to adequately represent variation in the observed data. The 
comparatively large value of DIC reported for the fifth model reflects some 
combination of lack of fit and a posterior that is more dispersed than others 
considered. 

4. Extensions. In addition to providing a convenient mechanism for as- 
sessing model adequacy, values of R B generated from a posterior distribution 
may prove useful both as a convergence diagnostic for MCMC algorithms 
and for detecting errors written in computer code to implement these algo- 
rithms. 

Monitoring values of R B generated within an MCMC algorithm provides 
a rudimentary convergence diagnostic for slow-mixing chains. In fact, ex- 
ceedances of R B over a prespecified quantile from its null distribution can be 
incorporated formally into the convergence diagnostics proposed in Raftery 
and Lewis (1992). To the extent that such exceedances are adequately de- 
scribed by a two-state Markov chain, the use of R B in this context eliminates 
the requirement to assess convergence on a parameter- by-parameter basis, 
as is normally done in Raftery and Lewis' diagnostic scheme. It also provides 
a natural mechanism for determining whether burn-in has occurred. 

A less obvious but perhaps equally important use of the R B statistic in- 
volves code verification. Many practitioners currently fit models using cus- 
tomized code written for their specific application, a practice that frequently 
results in coding errors that are difficult to detect. This problem can be 
largely overcome by simply monitoring the distribution of R B , which, in my 
experience, tends to deviate substantially from its null distribution when a 
model has been misspecified or miscoded. 

5. Discussion. Goodness-of-fit tests based on the statistic R B provide a 
simple way of assessing the adequacy of model fit in many Bayesian models. 
Essentially, the only requirement for their use is that observations be condi- 
tionally independent. From a computational perspective, such statistics can 
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be calculated in a straightforward way using output from existing MCMC 
algorithms. 

Approximating the sampling distribution of A, though conceptually straight- 
forward, does introduce an additional computational burden, but is neces- 
sary only when the achieved value of A is "significantly" larger than 0.5. 
Significance of A in this context has a natural interpretation in terms of the 
posterior probability that a sampled value of R B exceeds a random variable 
drawn from its nominal x 2 distribution. In this regard, values of A that are 
close to 0.5 may indicate adequate model fit for the purposes of a given 
analysis even when the sampling distribution of A pp would permit rejection 
of the model in a significance test. 

Aside from applications in Bayesian model assessment, the \ 2 statistic 
proposed here can be extended, albeit somewhat awkwardly, to models es- 
timated using maximum likelihood. In that setting, parameter values can 
be sampled from their asymptotic normal distribution and used as if they 
were sampled from a posterior distribution. Although not entirely palatable 
from a classical perspective, such a procedure does provide a mechanism 
for conducting a (suboptimal) goodness-of-fit test for complicated models in 
which alternative tests may be difficult to perform. 

APPENDIX 

Outlines of proofs of theorems and corollaries. The proofs of Theorem 
1 and Corollary 1 are based largely on the proof given in Chernoff and 
Lehmann (1954) in establishing the asymptotic distribution of R. 

Assume that conditions specified in Cramer [(1946), pages 426 and 427] 
and Chen (1985) apply. Cramer specifies conditions that are sufficient for es- 
tablishing the distribution of the x 2 goodness-of-fit statistic when evaluated 
at the parameter vector maximizing the likelihood estimate based on the 
grouped data, whereas Chen's conditions are sufficient for establishing the 
asymptotic normality of the posterior distribution. Essentially, these con- 
ditions require that the likelihood be a smooth function of the parameter 
vector 9 in an open interval containing 6$ (the true value of 9), that the 
posterior distribution concentrate around a point in this interval, that the 
information contained in the observations increase with sample size and that 
the prior assign nonnegligible mass to the interval containing 6. In addition, 
assume that all third-order partial derivatives of f(y\6) [or, in the case of 
the corollary, fj(y\0)] with respect to the components of 6 exist and are 
bounded in an open interval containing 6q. Finally, note that all expecta- 
tions and statements regarding probabilistic orders of magnitude described 
below are computed under the sampling distribution of y given Oq. 

The following lemmas are needed. 
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Lemma A.l. Under the conditions stated above, if refers to the max- 
imum likelihood estimate of 0, 6 refers to a value of 9 sampled from the 
posterior distribution and m k (-) refers to the number of counts assigned to 
the kth bin at a specified value of 6, then 

(9) A=(m k {0) - m k (6)) = -±=(m%{d) - m£(0)) + o p (l) 

(10) = _Lg^) ft _ft) + ^(i), 

where 

ml(6)=nB[Ind(ye[F- 1 (a^ 1 \e),F- 1 (a k \e)])]. 

Proof. Expanding rrit(0) in a Taylor series expansion about m* k {6) 
yields 

(11) ml(6) - mJS(d) = ^^-(Oi ~ 0i) + O p (I 
Define 



d6i v * L ' P \n 
i=i 1 x 



Then 



\Az kJ \ < Inc% G [min^-^afc-il^.F-^afc-ild)), 

maxCF-Hofc-ilO).^ -1 ^*-!^))]) 
+ Ind( yi e [min(F- 1 (a fc |d) l ii , - 1 (ojfe|d)) > 

max^-^Ofcl^^-^afcie))]). 
Because (0 — 0) is O p (l/y / n), y/nAz k j = O p (l). It follows that 

\Jn yjn 
Substituting this expression into (11) yields (10). □ 

Corollary A. 2. The previous lemma also applies if Oq is substituted 
for 0, that is, 

■±={m k (0 Q ) - m k {6)) = -^(r4(0o) - m%(0)) + o p {\) 



_Lg^ (90 „^, )+Op(1 , 
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Lemma A. 3. Define 
(12) p k = FiF-^aklGoM - F[F^(a k ^\9 )\9] 
Then, under the conditions stated above, 



F-iK-i|0o) 



f(y\o)d y . 



(13) 



Pk~Pk = -(m* k (6 ) -m* k (6)) + Op(- 
n \n 



Proof. For notational simplicity, define 

G( Ci ^c)=F\F-\c\ 1 )\b\ 

and 



d5 r 



5=7 



Then, noting that m* k (0 o ) = np k = n(G(9 ,9 ,a k ) - G(0 , 9 , a fc _i)), 

(Pk-Pk) ~ -(m* k (9 ) -m%{0)) 
n 

= [G(0 , 0\ a k ) - G(0 O ,0; a k ^)} + [G(9, 9 ;a k ) - G(9, 9 ; 0jfe _i)] - 2p fc 
^^Hi{0Q;a k )(9i — Oqj) — E Hi{Oo\ a k-i)(@i — 0o,i 



+ /",-Hi(0;Qfc)(flo,i — #i) — y^-ffi(0; Qfc-l)(0Q,i — #i 
i i 

^[#i(0 o ;a fe ) - Fi(0;a fe )](^ - O)i ) 

i 

- ^[if,(6> ; ctfc-i) - Hi(9; o fc _i)](fli - o ,i) + O p ( - 

i 

dHi(9 ;a k ) dHi(9 ;a k -i 



+ OJ- 



n 



EE 

ft. i 

a: 1 

n 



Corollary A. 4. 



50 o ,ft 



d9a h 



iflh — &0,h){Qi — ^O.i) + Op I — 

1 n 



□ 



V^Pfc - Pfc) = ~-=(mjfc(0o) - mjfc(0)) + O p ( -= 
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Proof of Theorem 1. Decompose the terms appearing in (3) as fol- 
lows: 

ra fc (fl) -np k _ m k {9) - m k (9) _ m k (9 ) - m k (9) + m k (9 ) - np k 



/np k 



/np k 



/np k 



/npk 



From Lemma A.l and Corollary A. 2, the first two terms on the right-hand 
side of (14) are asymptotically equivalent to 



(15) 



Ei<9<(0)M 







and 



ZidrnKOydOiVo, 



i) 



/np k 



/np k 



Also, (9 — 9) is asymptotically normal with mean and covariance matrix 
equal to the negative inverse of the information matrix [Chen (1985)]. So, 
too, is (9 — 9q), and the two quantities are asymptotically independent [e.g., 
Olver (1974) and Cox and Hinkley (1974)]. 

Following Chernoff and Lehmann (1954), define e to be a K x 1 vector 
with components 

m k {9 ) -np k 



inp k 



and let v be the vector with components 

Vk = Vn(pk ~ Pk)/ V?k- 
It follows from their results that 

(16) i> = D(J + J*)- 1 (D / e + VnA*) + o p (l), 

where J* is the matrix whose (i,j)ih component is 

~dlogg(y\z,9) dlogg(y\z,9) 



E 



d9 j 



g(y\z,9) is the conditional distribution of y given z and 9,3 = D'D is the 
matrix with elements 



K 

E 



1 dpk dp k 



^ Pk d9 a 99 b ' 
and A* is the vector whose ath component is 

nfr{ de a 

From the second corollary, the right-hand side of (16) also describes the 
large sample distribution of (m k (6o) — m k {9))/ \jnp k . 
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Taking r/ = ^/nA* and invoking the central limit theorem, Chernoff and 
Lehmann note that the asymptotic distribution of (e,Tj) is 



(17) 



N 



0. 



-qq' 
J* 



where q is the vector with components \Jp~k- Letting e denote a variable 
having the same distribution as e, and r denote a variable having the same 
distribution as rj, with all four variables distributed independently, it follows 
that R B has the asymptotic distribution of 

(Te + Sr - Te - Srj + e)'(Te + St - Te - Sr] + e), 

where S = D(J + J*) -1 and T = SD'. Noting that D'q = 0, the asymptotic 
distribution of (Te + Sr - Te - Srj + e)' is N(0, 1 - qq'). The result follows. 
□ 



Proof of Corollary 1. Because the proof of this corollary is similar 
to the proof of Theorem 1, only an outline is presented here. 

To begin, note that Lemma A.l and Corollary A. 2 extend to this setting 
if m* k (9) is redefined as 

n 

m%{0) =^E[Ind( yi G [Fr\a k _ x \6l Fr\a k \e)})\. 

Next, Lemma A. 3 applies if (12) is modified so that 

p j)k = FjiF^iaklOoM - F^irtofc-ilflo)!©] 

(18) 

rF-\a k \e Q ) 

= / 3 fMQ)dy, 

where p^ k and related estimates refer to the probability that the jth obser- 
vation falls into the fcth bin. Then 

(19) p j>k - Pj>k = ( Z * ik (e ) - 4,(0)) + o p , 

where 

zl k (9) = E[Ind(y, G [Fr\a k ^\e), Fr\a k \e)])]. 
Corollary A. 4 generalizes to 

1 " 1 n * / i \ 
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Extending Chernoff and Lehmann's (1954) result to the case of noniden- 
tically distributed random variables requires the following modifications of 
the definitions of variables used in the i.i.d. case. Let 

e = (e 1} . . . , e n ) , 



Then 



~ Pj,i z i,K ~ Pj,K 

n K 



EE 



1 dp a r dp, 



a.r 



D 



a=lr=1 Pa,r Odi 86 j 

' i 9pi,x 1 <9pi,i \ 









S0 S 


1 


dpi,K 


1 


dpi,K 


1 


dOi ' 

dp2,i 


1 


dd s 
dp2,i 




dOi ' 




de s 


1 


dp n ,K 


1 


dp n ,K 




y/P^K dd s J 



E 



E 

,a=l 



E 

V3=l 



9log g/3{y\z,0) 
80* 



A* 



1 ^ dloggj(y\z,9) 



n 



00* 



yfnpj;. 



v = D(J + J^-^D'e + v^A*) + op(l). 
The covariance matrix of e may be written 

qiqi' ... 



^ 1 

— i-nxK 

n n 



... qnq^ 



where q* is the vector whose jth component is y/pij ■ Denote the rightmost 
matrix in this equation by Q . Similarly, define r\ = ^/nA* . Then the asymp- 
totic distribution of 77 has mean and covariance matrix equal to 3*/n, and 
is independent of e. 
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Letting r denote the vector with components (z k j(0) — z k j (#))/ (y/ n Pj,k ) > 
it follows from the generalization of Corollary A. 4 that the distribution of 
Pf is asymptotically the same as that of P£. Letting f denote the vector 
with components (z k j(6) — z k j(0)) / (^/npj^k) , then Pf and Pf are, for large 
n, independent and identically distributed. Noting that 

R B = (e - f + f ) / P / P(e - f + f ) 

and that D'Q = 0, some algebra and application of the central limit theorem 
yields the desired result. □ 

Proof of Corollary 2. Expanding the components of R B (6) yields 

, 20 \ m k -np k (6) = m k -np k (6 ) _ n(p k (6) -p k {0 )) _ n(p k (6) - p k {0)) 
sjn y/n \Jn y/n 

Asymptotically, Taylor series expansions show that the second term on the 
right-hand side of this equation has the distribution of Te + Srj described in 
the proof of Theorem 1, while the third term has the distribution of Ts + St. 
The result follows using methodology in the proof of Theorem 1. □ 
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